-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retry on serialization errors #1342
retry on serialization errors #1342
Conversation
19e6394
to
6f3a4db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contributing -- some comments inline.
To be honest, I don't completely understand the issue yet. If, as indicated in your issue, everything works fine without the serialization requirement, since the ids are always distinct, why can't that be serialized? 🤔
storage/sql/sql.go
Outdated
|
||
err = tx.Commit() | ||
if err != nil { | ||
if pqErr, ok := err.(*pq.Error); ok && pqErr.Code == "40001" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Can we do something like
pqErr.Code.Name() == "serialization_failure"
using Name
and the string representation? It would be a tad more explicit, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah that's much better.
storage/sql/sql.go
Outdated
return err | ||
} | ||
defer tx.Rollback() | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ This is addressing an edge case, but can we limit the number of retries? Also, do we care that it's hammering or does this need some wait times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, to be honest I'm not sure. The Postgres docs just say "retry" but don't say anything about backoff and attempts.
I'm a little nervous about limiting it as it could potentially mean bubbling up the error, given enough concurrency. I agree that it's scary to have it unbounded, but so long as we're getting serialization failures, I'm not sure there's any resolution other than to retry until it works (which should eventually happen as requests calm down). It helps that we're at least catching this very specific error code and only retrying on that one.
Not sure about the hammering/wait times. In my experience it's sufficient to just try again immediately to resolve concurrency "disputes", since they tend to resolve fairly quickly, and the net effect is usually just a few more queries (which the database should be able to handle). If it were a failure that could come from load or network connectivity, I'd totally agree, but this almost feels more like regular handling of database behavior. 🤷♂️
storage/sql/sql.go
Outdated
} | ||
defer tx.Rollback() | ||
for { | ||
tx, err := db.Begin() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be worth switching to BeginTx(context.TODO(), sql.LevelSerializable)
, which would, I think, mean we can ditch the explicit SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose if we did that we could also go all-in on contexts and do something like
ctx, cancel := context.WithCancel(context.TODO())
defer cancel() // will rollback transaction
tx, err := db.BeginTx(ctx, nil /* or some proper level? */)
if err != nil {
return err
}
if err := fn(tx); err != nil {
return err
}
err = tx.Commit()
// etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat, didn't know that was a thing! I'll try it out.
6f3a4db
to
2c294ca
Compare
@srenatus Yeah, I'm a bit confused by it too. Maybe the failures have something to do with concurrent reads, not just writes? There's also the periodic auth request garbage-collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found a problem. Let's hold off on merging for now - will have to take another swing at this.
@srenatus Alright, I've pushed another commit that refactors all the error wrapping out of the Along the way I tried to keep the error wrapping/handling consistent across all the With this all cleaned up, I've got our integration tests successfully running in a loop with ~23 concurrent logins. |
storage/sql/sql.go
Outdated
|
||
if err := fn(tx); err != nil { | ||
if pqErr, ok := err.(*pq.Error); ok && pqErr.Code.Name() == "serialization_failure" { | ||
// serialization error; retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the initial bug report you mentioned
The Postgres docs say "applications using this level must be prepared to retry transactions due to serialization failures", which Dex doesn't seem to be doing, and it doesn't feel right to force the client to retry API/login requests due to this, and I don't think they'd even have enough information to recognize that it's a serialization failure to retry in the first place.
Can you add a link to those docs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
@@ -114,7 +114,7 @@ import: | |||
- package: github.com/mattn/go-sqlite3 | |||
version: 3fb7a0e792edd47bf0cf1e919dfc14e2be412e15 | |||
- package: github.com/lib/pq | |||
version: 50761b0867bd1d9d069276790bcd4a3bccf2324a | |||
version: 9eb73efc1fcc404148b56765b0d3f61d9a5ef8ee |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ There's a lot going on between these two commits -- is there anything in particular that you're after, or is this a chore-type update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was necessary for BeginTx
with transaction level support: lib/pq@ed1a26d
storage/sql/sql.go
Outdated
@@ -74,6 +75,11 @@ var ( | |||
} | |||
|
|||
if err := fn(tx); err != nil { | |||
if pqErr, ok := err.(*pq.Error); ok && pqErr.Code.Name() == "serialization_failure" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 So, the serialization_failure
error potentially comes back from the fn(tx)
, and not only from the Commit
below, as previously thought? Also, do we know if it ever comes back from the tx.Commit()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw it come from both. I believe it just comes down to timing.
@vito thanks for hanging on to this 👍 looks like in the meantime, some other PRs have gotten into this PR's way (conflicts). |
@srenatus Yep, just working on fixing the dependencies up now and adding a comment per @ericchiang's suggestion. Will have it ready soon. |
1811c16
to
6d3dbc8
Compare
OK, all rebased and good to go! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I still don't understand the issue as good as I'd like, but this seems to be an improvement. I've yet to see these in the wild; but your test suite agrees, so 👍
💭 I wonder if your tests depend on your infrastructure/stack, or could become part of github.com/dexidp/dex, too? 😉
// | ||
// "applications using this level must be prepared to retry transactions due to | ||
// serialization failures" | ||
func isRetryableSerializationFailure(err error) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ℹ️ This is probably good to go, but I'd like to take a dive into PG doc land to wrap my head around this some more. However, I'll do that tomorrow morning, probably. In the mean time, there's not going to be a release out of thin air, so I'm fine either way -- shall we merge this now or later? |
@srenatus Doesn't matter to me, feel free to take your time. We've still got a few things on our fork so we won't be blocked on a merge either way. |
also use a context for the rollback, which is a bit cleaner since it only results in one 'defer', rather than N from the loop
prior to this change, many of the functions in the ExecTx callback would wrap the error before returning it. this made it impossible to check for the error code. instead, the error wrapping has been moved to be external to the `ExecTx` callback, so that the error code can be checked and serialization failures can be retried.
6d3dbc8
to
85dd068
Compare
I've tried replicating the failures. And, while I was able to trigger the condition using concurrent updates, I have no idea why these happen. This isn't entirely satisfying, but since the internet says that there doesn't have to be a serialization failure for these retryable errors to occur, and applications need to be aware of this, I think this is good to go. The |
Oh, forgot to mention: using this branch, those failures, while still in PGs logs (of course), didn't matter anymore, as the retry was successful. 👍 ℹ️ What I've done locally is quick and dirty, so, nothing to commit at this moment. |
…ation-error retry on serialization errors
bump lib/pq taken from @vito's dexidp#1342.
Fixes #1341
Turns out it's easy enough to just capture the specific serialization error code and retry the transaction.
Tested this locally - the storage conformance tests pass (they failed when removing serialization altogether), and our integration tests no longer fail on concurrent logins.